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Abstract. Texts exhibit considerable stylistic variation. This paper reports an experi- 
ment where a corpus of documents (N= 75 000) is analyzed using various simple stylistic 
metrics. A subset (n = 1000) of the corpus has been previously assessed to be relevant 
for answering given information retrieval queries. The experiment shows that this subset 
differs significantly from the rest of the corpus in terms of the stylistic metrics studied. 



1 Introduction 

Texts vary not only by topic. Stylistic variation between texts of the same topic is often at least 
as noticeable as the topical variation between texts of different topic but same genre or variety; 
style is, broadly defined, the difference between two ways of saying the same thing. 

Stylistic variation in a given text, given the liberal definition above, can occur in many ways 
and on many linguistic levels: lexical choice, choice of syntactic structures, choice of cohesion 
markers on a textual level, and so forth. Some choices are constrained by the intended audience 
and discourse ecology the text is produced in; some are left to be entirely determined by 
the author's preferences and personal idiosyncracies. In this experiment the stylistic variation 
investigated is taken from a fairly well-edited body of text - the Wall Street Journal - where 
presumably most writers conform to the expected norms of writing, and if not, the texts will 
be edited by professional editors to conform to them. The focus will be on stylistic variation 
based on the specific genres or functional styles that occur in a daily newspaper - as opposed 
to individual styles [13]. Especially, the experiment investigates how the variation relates to the 
usefulness of texts in a large scale information retrieval experiment. 

This experiment makes some simple measurements of style markers, indicative of stylistic 
variation and of genre, on a largish corpus of documents, and compares a subset of documents 
that have previously been judged relevant for answering queries in an information retrieval 
experiment with the rest of the corpus. 



2 Corpus and Statistical Measurements 

The Text REtrieval Conference (TREC), organized in the form of a competition by US gov- 
ernment research agencies, gives participating research organizations access to a large corpus 
of texts and a set of queries that are to be used for retrieving texts from the corpus. The texts 
that the organizations and systems participating in the competition suggest as most relevant 
for a query are read by a number of judges, and assessed as either relevant or not relevant. 
These assessments - the relevance judgements - are used in this experiment to categorize the 
entire corpus. Given a query, the corpus has three types of document: relevant texts, not rele- 
vant texts, and not judged (i.e. not retrieved) texts [6]. In this experiment, if a text was judged 
relevant for any query at all in the test set we will consider the text relevant. 



For this experiment, a part of the TREC test corpus was selected: 74 516 Wall Street Journal 
articles from years the 1990, 1991, and 1992. Of these documents, 1116 were marked relevant for 
at least one of fifty queries (TREC queries 201-250) and 12 482 marked non-relevant - judged, 
but not relevant for any of the queries 1 . Initially, the documents were analyzed to obtain simple 
sentence statistics and to obtain simple measures for syntactic complexity 



3 Hypotheses 

The hypotheses of the experiment were 1) that certain genres or types of text would be more 
likely to provide the answers the human judges would prefer, and 2) that this preference is clear 
enough to be detectable even using the fairly simple mechanisms tested in this experiment. 

The stylistic variation was expected from two reasons. Firstly, by the likelihood that the 
corpus contains materials that will never be useful in a generally framed information retrieval 
task such as TREC: stock report tables and the like; secondly, by the fact that the human judges, 
while well trained for the task, are likely to exhibit biases for certain types of documents, namely 
those which are easy to judge as being relevant or not. 



4 Results 

The results are positive. Texts that were found relevant did differ systematically from texts 
which arc not found relevant; for most metrics tested, the difference was statistically significant 
even by univariate tests. 2 In addition, we find that relevant texts and non-relevant texts taken 
together - i.e. texts retrieved by systems participating in the TREC evaluation - differ from 
the rest of the corpus in a systematic manner. The difference between relevant and non-relevant 
texts is much smaller than the difference between either of them and the non-judged portion of 
the corpus. 

In summary, the results of this experiment show that retrieved highly ranked texts - both 
relevant and non-relevant - are longer, with a more complex sentence structure than the rest of 
the corpus, and that relevant texts differ from nonrelevant in that they tend to be even more 
complex textually. 

1 3493 articles were retrieved for more than one query. Three articles were retrieved for more than 25 
of 50 queries. 

2 Since normal distribution assumptions cannot be expected to hold for language data of this sort, 
we used the Mann Whitney U rank sum test, which makes fewer assumptions about the values 
and distributions of the variables, to test for the significance of detected differences. There are no 
standard tests for multivariate significance for "nonparametric" variables, i.e. variables without an 
approximation of their value space; this means that this experiment will miss interactions between 
variables, testing them one by one, rather than risking a false positive result from using a multivariate 
test based on false assumptions. The Mann Whitney U test is one of two equivalent formulations - the 
other is Wilcoxon's rank sum test - for calculating significant differences between some measurement 
in two sets. The sets are sorted together by the result of the measurement, and the sum of ranks is 
calculated for one of the categories. If there is a difference in the measurement results between the 
categories, the sum will tend to be either high or low; the thresholds for a significant difference from 
the expected average value for the rank sum are calculated using the relative sizes of the sets. 



4.1 Simple statistics: Sentence Length and Word Statistics 

A simple word count reveals that relevant texts on the average are longer than other texts - 
which also has been observed, pointed out, and utilized by the very successful Cornell research 
group at the latest TREC conference [3]. 

Word statistics - word length, long word counts, type/token ratios - as a measure of termi- 
nological complexity have often been paired with sentence length to produce readability scores 
and recently to study variation in various varieties of language, as well as perform genre discrim- 
ination [1, 2, 8, 9]. Table 1 contains a summary of the simple statistics. Relevant texts, besides 
being longer, also have longer sentences. The differences between relevant and non-relevant are 
significant in a Mann Whitney test on a 95% confidence level for all statistics except average 
word length. 



Category 


Number Word count Type-token ratio Word length Words per sentence 


Relevant 
Misses 
Not judged 


1116 755 0.527 5.08 19.8 
12482 675 0.551 5.07 19.3 
60918 396 0.611 5.03 19.2 



Table 1. Simple statistics for the corpus 



Longer texts are more likely to be relevant at least partly due to the fact that longer texts 
range over several topics, and thus there is a chance that a long text will touch a relevant topic. 
In this experiment, we find that not only are relevant documents longer, but all documents 
retrieved by systems, even those assessed by human judges as irrelevant, also are longer than 
the average document. Not only will longer texts touch relevant topics - but apparently they 
may well touch irrelevant but confusingly similar topics. 

The non-retrieved portion of the corpus turns out to contain large numbers of very short 
items, and large numbers of tables and numerical information, both short and long, which the 
retrieval systems have not proffered to the assessors for consideration. These texts presumably 
simple have less topical information, and thus are hit less often by the retrieval systems used. 
Running a subtopic segmentation algorithm over a number of texts 3 produces the expected 
result. For the experiment we ran a system - TextTiles - which cuts up a text into tiles, 
tentative subtopic segments [7]. For the purposes of the experiment, only the number of tiles 
were retained. The number of tiles was higher for the relevant documents than for the non- 
relevant ones, but when the experiment was rerun on texts categorized by length, we find that 
long relevant texts tend to have fewer subtopics than the short ones, in contrast with the shorter 
texts - see Table 2. The difference is better than 95% significant by Mann Whitney U for the 
larger number of documents, for the long texts the risk of random results is around 10%. 

4.2 Syntactic complexity 

Syntactic complexity is a dimension which exhibits considerable variation between genres [10, 
11]. Indeed, most stylistic measures heretofore have been attempts to find shortcuts for mea- 
suring syntactic complexity along with lexical complexity as measured by word length and 

3 This experiment ran on a smaller corpus of only relevant and non-relevant documents. 



Category 


Number Tiles 


Documents of all lengths 


Relevant 
Not relevant 


756 3.8 
4406 3.6 


Documents over a thousand words 


Relevant 

Misses 


176 8.1 

985 8.8 



Table 2. Average number of tiles 



type-token ratios. Sentence length as used above is one such method, although arguably a blunt 
one - what syntactic constructions are complex in themselves, and when they are evidence of 
complexity in an already complex subject matter is a matter of contention [4]. 

As a somewhat deeper approximation of clause complexity, we will look at the average depth 
of output trees from a robust parser built for information retrieval purposes [12]. The parser 
was set to skip parsing after a timeout threshold, and when it does so, it notes it has done so in 
the parse tree. These skip marks were counted - again, as an indication of clausal complexity. 
We find as shown, in Table 3, a clear distinction between the various categories of document. 
Relevant documents have, on average, deeper parse trees and more skips. Both measures show a 
significant difference between relevant and non-relevant documents, again by a Mann Whitney 
test on better than a 95% confidence level. 



Category 


depth skips 


Relevant 
Non-relevant 
Not assessed 


10.0 0.499 
9.88 0.456 
9.56 0.409 



Table 3. Trees and Skips 



5 Defining Genres 

Stylistic variation, as indicated above, is partly an effect of genre variation. To get closer to the 
genres one can expect to find in the corpus text one issue of Wall Street Journal (910102) was 
categorized manually into ten rough categories: articles, business news with, and without ta- 
bles, lists of paragraph length items, editorials, letters, paragraph-length items, "What's News" 
(menu-type lists of one-sentence items), tables, and single one-sentence items. This was used 
as a training set to categorize the entire corpus: simple stylistic measurements for the hand 
categorized data - as shown in the previous section 4 - were used in a discriminant analysis, and 
the resulting discriminant functions were used to automatically categorize the entire corpus. 
The details of the method are not important: the result is sloppy in any case. No checking was 



4 With one extra measure added: digits per character, multiplied by 1000. 



N Genre Category 


Tree Skips Words Tokens Chars Digits Words 
Depth /Types /Word /kChars /Sent 


11 331 A 

^fl (9 QVr>\ Rplpvnnt 

3094 Not Relevant 
7907 Not Judged 


rn 9 (1 K97 Q80 PI 478 ^ DO 1 Q 4 

10.0 0.511 988 0.476 5.05 3.01 18.9 
9.86 0.498 782 0.503 4.99 3.44 18.2 


209 B 

^ (9 ^Vr>\ Rplpvnnt 

70 Not Relevant 
134 Not Judged 


8 4D fl ^RD R71 7 fl 39^ 4 74 "39 7 911 

8.40 0.341 3933 0.346 4.82 21.5 17.7 
8.20 0.165 1481 0.335 4.62 38.9 27.1 


13 669 C 

^flQ (9 9%1 Rplpvnnt 

2516 Not Relevant 
10844 Not Judged 


Q 8Q fl ^09 R77 fl ^1 1 ^ D8 7 ^8 9PI ^ 

9.87 0.482 656 0.504 5.09 7.73 20.7 
9.44 0.459 528 0.516 5.00 10.4 20.5 


6006 D 

1278 Not Relevant 
4604 Not Judged 


q fi^ n /isn i nno n /iQ^ /iQpi ^9^ i s 9 
y.oo u.^ou iuuy u.^yo 4.yo o.zo io.z 

9.53 0.477 1075 0.484 4.94 4.89 17.9 

9.38 0.464 835 0.514 4.90 6.00 17.8 


2613 E 

4Q l~\ 8%"l Rplpvnnt 

604 Not Relevant 
1960 Not Judged 


in^fl^lfi 194Q fl 449 4 8Q 9 Q7 1Q^ 

9.91 0.503 1228 0.446 4.90 3.06 18.9 
9.95 0.499 855 0.486 4.86 3.24 18.7 


3187 F 

48 (~\ c >%1 Rplpvnnt 

707 Not Relevant 
2432 Not Judged 


1 fl fi fl ^8D ^Q7 fl W4 ^17 4 V) 919 

10.1 0.484 503 0.577 5.20 3.49 19.8 
9.78 0.434 367 0.600 5.12 4.53 19.6 


21 941 G 

18^ (fl R%\ Rplpvant 
2526 Not Relevant 
19232 Not Judged 


If) ? n4W 941 fl R9Q ^18 (S 9^ 90 8 

-LU.O U.^U^ ^^±_L U.U^y U.-LO u.^u ^u.o 

9.90 0.409 189 0.644 5.17 7.29 20.3 
9.55 0.388 169 0.651 5.10 8.73 19.6 


3539 H 

21 (0.5%) Relevant 
490 Not Relevant 
3028 Not Judged 


9.24 0.397 535 0.588 4.91 21.1 13.7 
8.83 0.402 643 0.543 4.85 27.0 11.7 
8.17 0.331 467 0.566 4.83 27.1 14.5 


1096 I 

6 (0.5%) Relevant 
145 Not Relevant 
945 Not Judged 


9.12 0.460 377 0.603 4.35 51.1 18.7 
8.33 0.275 677 0.610 4.29 77.2 17.0 
7.12 0.150 250 0.691 4.67 65.2 24.9 


10 925 J 

41 (0.3%) Relevant 
1052 Not Relevant 
9832 Not Judged 


10.1 0.476 107 0.743 5.23 6.31 22.4 
10.4 0.330 75 0.800 5.24 7.25 20.2 
10.1 0.328 70 0.805 5.15 8.14 19.5 



Table 4. Clusters based on stylistic data, and their proportions of relevant documents 



made to see how well and consistently the articles were categorized in the genres given; the idea 
was simply to have a seed set to cluster the documents around. In Table 4 some statistics for 
each category are shown; the category names have been replaced with letters so as not to imply 
the categories are consistent with real life genres. 5 

The hypothesis was that a simple stylistic clustering might well prove useful thanks to its 
anchoring in genre, and in spite of this anchoring being quite tentative. The table shows that 
there are considerable differences between the categories in stylistic metrics - unsurprisingly, 
since they have been clustered to maximise that difference - but more importantly, the categories 
show considerable differences in how large a proportion of the documents are relevant, and most 
importantly, in how the relevant documents differ from the nonrelevant ones stylistically. For 
instance, whereas in category A, relevant documents will have longer sentences on average than 
non-relevant and non-retrieved documents, in categories C and H the relevant documents will 
have shorter sentences; and whereas most categories prefer documents with a low type-token 
ratio, category H prefers documents with a high ratio. 6 

6 Conclusions 

Texts differ in style. In this experiment, automatically retrieved texts differed from non- retrieved 
texts along several simple stylistic metrics. This shows that either 1) retrieval mechanisms are 
biased for style, or more likely, 2) style and topic go hand in hand. Neither of these results are 
surprising. Nonetheless, they may be a useful point to note for information retrieval system 
designers. 

What is more interesting, and a good starting point for user-oriented information retrieval 
studies is utilizing this type of measure in distinguishing interesting texts from less interesting 
ones. This will entail analyzing the tasks and expectations of users; this experiment shows that 
for a certain set of users and for a certain scenario a clear bias towards a certain types or genres 
of text can be found, namely the one between relevant and non-relevant in the experiment. 

The experiment also shows that stylistically determined genres or functional styles are dif- 
ferent as regards potential usefulness for the queries tested, and that the distinctions between 
relevant and non-relevant differ between genres. 

The differences between relevant and non-relevant texts found should not be taken as general 
results: while useful in a TREC context, as shown by the results from Cornell, they are clearly 
an effect of the task, corpus, and assessors. These results should be taken as a starting point in 
investigating how situations affect measures of stylistic variation. 
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